Data Analytics for Finance
Hands-on Practice
In Assignment 3, you’ll apply these concepts: run OLS regressions, test assumptions, and create publication-quality tables.
| Concept | Meaning | Example |
|---|---|---|
| Data‑Generating Process (DGP) | The complete set of rules that determine how the data you observe are created | Newton’s law of gravity |
| Variation | The differences in a variable’s value across observations. | How people’s incomes differ by hair color across individuals. |
| Identification | The process of ensuring that the variation you exploit is causal, not a spurious alternative | Chocolate consumption and Nobel laureates (this is spurious) |
Identification is the bridge between:
Assumes that observations are produced by underlying laws or processes
Used to identify which parts of the DGP explain the data and refine research designs
Assumptions help block non-essential variations, focusing on the aspect of data that aligns with the research question.
Part of DGP is understood (helps form assumptions); part is unknown (area of exploration)
Research relies on known aspects of DGP to interpret data and identify genuine causal effects
Techniques like controlling for variables, subgroup analyses, or hypothetical scenarios help test specific segments of DGP
Continued research and empirical testing refine and validate assumptions about parts of the DGP
When we try to estimate causal effects, we face a fundamental challenge:
Endogeneity occurs when the explanatory variable (X) is correlated with the error term (ε) in our model.
This correlation can arise from several sources… let’s focus on two of the most common1
Selection bias occurs when individuals [self-select[ into treatment based on characteristics that also affect the outcome.
Omitted variable bias occurs when a relevant variable that affects both the explanatory variable (X) and the outcome (Y) is left out of the analysis.
A variable Z causes OVB if:
The connection
Selection bias is one cause of OVB. When people self-select based on characteristic Z (which also affects Y), and we omit Z from our model, we have OVB.
Key insight
If ability affects both LLM use and exam scores, any naive comparison of LLM users vs. non-users will be confounded.
Endogeneity can:
We often cannot directly observe the confounding variables!
Let’s see this problem in action with a concrete example…
Disclaimer
I created this example (incl. the data) for educational purposes only. It does not represent real individuals or actual events.
Research question
What is the effect of LLM use on students’ exam scores?
But…
Is this effect causal? Or are there confounding factors at play?
| Variable | LLM | No LLM | Difference | t-stat. | p-value |
|---|---|---|---|---|---|
| Exam score | 7.04 | 6.55 | 0.50 | -5.99 | 0.00 |
| Attendance rate | 74.15 | 75.43 | -1.28 | 1.42 | 0.15 |
| Female | 0.49 | 0.50 | -0.01 | 0.39 | 0.70 |
| Age | 21.37 | 21.34 | 0.03 | -0.18 | 0.86 |
| Study hours | 20.09 | 20.34 | -0.25 | 0.35 | 0.73 |
Do you have any ideas? What could be part of the data generating process that we do not observe in the data?
What happens if we could observe and condition on (or control for) ability?
| Group | LLM | No LLM | Difference |
|---|---|---|---|
| Low ability | 5.83 | 6.13 | -0.31 |
| High ability | 7.45 | 7.72 | -0.26 |
| Group | LLM | No LLM | Difference |
|---|---|---|---|
| Low ability | 122 | 379 | -257 |
| High ability | 364 | 135 | 229 |
Key insight
\[ \text{LLM users: }\frac{122 \times 5.83 + 364 \times 7.45}{486} \approx 7.04 \] \[ \text{Non-users: }\frac{379 \times 6.13 + 135 \times 7.72}{514} \approx 6.55 \]
LLMs hurt everyone — we were just fooled by who chose to use them.
This is Selection Bias
The groups differed systematically on ability. High-ability students selected into LLM use — and they would have done well anyway.
| Analysis | LLM Coefficient | Interpretation |
|---|---|---|
| Naive comparison | Positive | LLMs help! |
| Control for ability | Negative | LLMs hurt! |
This is a textbook case of selection bias manifesting as omitted variable bias:
Key takeaway
Naive comparisons can be misleading due to endogeneity issues like selection bias and OVB!
Critical insight
Omitted variable bias doesn’t just make estimates imprecise—it can completely flip the sign of your estimate!
\[\text{Naive estimate} = \text{True effect} + \text{Bias}\] \[\text{Positive} = \text{Negative} + \text{(Large) Positive}\]
Controlling variation in the causal variable, e.g. LLM use via random assignment
Makes sure that the treatment and control group are similar along observable and unobservable dimensions
The only difference between the two groups is the treatment
This allows us to attribute any difference in outcomes to the treatment
No selection bias (endogeneity issue)
Types of experiments:
But…
Did random assignment work? Are the groups balanced on observable & unobservable characteristics?
| Variable | LLM | No LLM | Difference | t-stat. | p-value |
|---|---|---|---|---|---|
| Exam score | 6.43 | 7.22 | -0.80 | 8.75 | 0.00 |
| Attendance rate | 74.62 | 75.01 | -0.39 | 0.43 | 0.66 |
| Female | 0.49 | 0.49 | 0.00 | -0.02 | 0.98 |
| Age | 21.29 | 21.43 | -0.13 | 0.91 | 0.36 |
| Study hours | 19.92 | 20.56 | -0.64 | 0.89 | 0.37 |
| Ability | 0.01 | 0.03 | -0.02 | 0.33 | 0.74 |
Linear: \[ Y = \beta_0 + \beta_1 X. \]
“Linear‑in‑parameters” (e.g., quadratic): \[ Y = \beta_0 + \beta_1 X + \beta_2 X^2. \]
\[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon, \] the coefficient on \(X\) is estimated using only the variation in \(X\) that is left after regressing \(Z\) out of both \(X\) and \(Y\) — that is, you “control for” \(Z\).
| Assumption | What it requires | Role |
|---|---|---|
| Linearity | \(E[Y \mid X] = \beta_0 + \beta_1 X\) | Correct specification of functional form |
| Exogeneity | \(E[\varepsilon \mid X] = 0\) | Ensures \(\hat{\beta}\) is unbiased and consistent |
| Homoscedasticity | \(Var(\varepsilon \mid X) = \sigma^2\) | Standard errors are efficient and unbiased |
| Independence | Observations are i.i.d. | Standard error formulas are valid |
| Normality (small samples) | \(\varepsilon \sim N(0, \sigma^2)\) | Validates t-tests and confidence intervals |
| Assumption | Plain English |
|---|---|
| Linearity | The relationship between X and Y is a straight line |
| Exogeneity | X is not correlated with anything else that affects Y |
| Homoscedasticity | The spread of errors is the same across all values of X |
| Independence | One observation doesn’t influence another |
| Normality (small samples) | Errors follow a bell curve—only matters for hypothesis testing with few observations |
| Assumption | If violated | How to check |
|---|---|---|
| Linearity | Estimates are biased; wrong model | Plot residuals vs fitted values |
| Exogeneity | Estimates are biased and inconsistent | Theory and DAGs; no direct test |
| Homoscedasticity | Standard errors are wrong (often too small) | Plot residuals vs fitted; Breusch-Pagan test |
| Independence | Standard errors are wrong | Durbin-Watson test; check data structure |
| Normality | t-tests and CIs invalid in small samples | Q-Q plot of residuals |
| Issue | If present | How to check |
|---|---|---|
| Multicollinearity | Estimates unbiased but imprecise; unstable coefficients | Correlation matrix; VIF > 10 |
| Outliers | Single observations can distort estimates | Summary stats; plots |
| Measurement error in X 1 | Coefficient biased toward zero (attenuation bias) | Theory; compare multiple measures if available |
| Small sample size | Estimates imprecise; normality assumption matters more | Check n relative to number of predictors |
| Missing data | Bias if not missing completely at random | Check patterns; compare complete vs incomplete cases |
| Component | Meaning | Typical notation |
|---|---|---|
| Rows (Coefficient, Standard Error) | Estimate of \(\beta_j\) and its SE (or t‑stat) | -0.021 (0.004) |
| Significance stars | Indicates p‑value thresholds | *** = p < .01 |
| \(R^2\) | Proportion of variance in \(Y\) explained by the model | 0.065 |
| Adjusted \(R^2\) | Corrects \(R^2\) for number of predictors | 0.065 |
| F‑statistic | Joint test that all non‑constant coefficients equal zero | 35.016 |
| RMSE | Standard deviation of residuals (or “root mean squared error”) | 1.307 |
Precision: Given SE=0.004, even small differences in coefficients are detectable
Significance: *** indicates the coefficient differs from zero at the 1% level
Model fit: \(R^2\) of 0.065 tells us only ~ 3% of the variation in scores is captured by the predictors
t-stat: The slope is about 5 standard errors away from zero (0.021/0.004 = 5), indicating strong evidence against the null hypothesis of no effect
Typically, we focus on the coefficient estimates, their precision (SEs), and significance levels to interpret OLS results.
llm in the naive OLS regression is equivalent to the difference in means between LLM users and non-usersability further refines the estimate by accounting for this key confounder and gives us the “true” effect of LLM useImportant caveat
In DiLLMa, we could observe ability and control for it. In real research, we usually cannot observe all confounders.
| What we did in DiLLMa | What happens in reality |
|---|---|
| Observed ability | Ability is unobserved |
| Controlled for it in OLS | Can’t control for what we don’t see |
| Got the “true” effect | Estimate remains biased |
When we can’t run experiments AND controlling for observables isn’t enough, we need quasi-experimental methods:
We’ll continue the DiLLMa story with panel data!
Thank You for Your Attention!
See You in the Next One!
| Variable | LLM | No LLM | Difference | t-stat. | p-value |
|---|---|---|---|---|---|
| Exam score | 7.04 | 6.55 | 0.50 | -5.99 | 0.00 |
| Attendance rate | 74.15 | 75.43 | -1.28 | 1.42 | 0.15 |
| Female | 0.49 | 0.50 | -0.01 | 0.39 | 0.70 |
| Age | 21.37 | 21.34 | 0.03 | -0.18 | 0.86 |
| Study hours | 20.09 | 20.34 | -0.25 | 0.35 | 0.73 |
| Ability | 0.56 | -0.50 | 1.07 | -20.11 | 0.00 |
The t-test measures how far the sample mean is away from \(H_0\)
Hypothesis:
Data Analytics for Finance